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ABSTRACT 

I introduce an algorithm for estimating parameters from multidimensional data based on for- 
ward modelling. It performs an iterative local search to effectively achieve a nonlinear inter- 
polation of a template grid. In contrast to many machine learning approaches it avoids fitting 
an inverse model and the problems associated with this. The algorithm makes explicit use 
of the sensitivities of the data to the parameters, with the goal of better treating parameters 
which only have a weak impact on the data. The forward modelling approach provides uncer- 
tainty (full covariance) estimates in the predicted parameters as well as a goodness-of-fit for 
observations, thus providing a simple means of identifying outliers. I demonstrate the algo- 
rithm, ILIUM, with the estimation of stellar astrophysical parameters (APs) from simulations 
of the low resolution spectrophotometry to be obtained by Gaia. The AP accuracy is com- 
petitive with that obtained by a support vector machine. For zero extinction stars covering 
a wide range of metallicity, surface gravity and temperature, ILIUM can estimate T G fj to an 
accuracy of 0.3% at G=15 and to 4% for (lower signal-to-noise ratio) spectra at G=20, the 
Gaia limiting magnitude (mean absolute errors are quoted). [Fe/H] and log g can be estimated 
to accuracies of 0. 1-0.4 dex for stars with G < 18.5, depending on the magnitude and what 
priors we can place on the APs. If extinction varies a priori over a wide range (0-10 mag) - 
which will be the case with Gaia because it is an all sky survey - then log g and [Fe/H] can 
still be estimated to 0.3 and 0.5 dex respectively at G=15, but much poorer at G=18.5. T e g 
and Ay can be estimated quite accurately (3-4% and 0.1-0.2 mag respectively at G=15), but 
there is a strong and ubiquitous degeneracy in these parameters which limits our ability to 
estimate either accurately at faint magnitudes. Using the forward model we can map these de- 
generacies (in advance), and thus provide a complete probability distribution over solutions. 
Additional information from the Gaia parallaxes, other surveys or suitable priors should help 
reduce these degeneracies. 

Key words: surveys - methods: data analysis, statistical - techniques: spectroscopic - stars: 
fundamental parameters - ISM: extinction 



1 INTRODUCTION 

Inferring parameters from multidimensional data is a common task 
in astronomy, whether this be inference of cosmological parame- 
ters from CMB experiments, photometric-redshifts of galaxies or 
physical properites from stellar spectra. Important questions about 
the structure and evolution of stars and stellar populations require 
knowledge of abundances and ages, which must be obtained spec- 
troscopically via the stellar atmospheric parameters effective tem- 
perature (Tea), surface gravity (logg) and iron-peak metallicity 
([Fe/H]). 
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Numerous publications have presented methods for estimat- 
ing such astrophysical parameters (APs) from spectra. The methods 
can be divided into two broad categories. The first, based on high 
resolution and high signal-to-noise ratio (SNR) spectra, makes use 
of specific line indices selected to be sensitive predominantly to 
the phenomena of interest. Examples include the detection of BHB 
stars via Calcium and Balmer lines (e.g. Brown 2003 1 and spectral 
type classification of M, L and T dwarfs via molecular band indices 
(e.g. Hawley et al. 2002). In each case these methods use a specific, 
identified phenomenon and soare generally limited to a narrow part 
of the AP space where the values of the other APs are reasonably 
well known. While simple and useful, these methods do not use all 
available information nor can they normally be used with low res- 
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olution data, because the effects of individual APs cannot then be 
separated. 

The second category of methods is global pattern recognition, 
which try to use the full set of available data. These are gener- 
ally used when we want to estimate one or more APs over a wide 
range of parameter space. In such cases there is rarely a simple 
relation between a single feature and the physical quantity of in- 
terest. (An example is determination of surface gravity, where the 
relevant lines to use change with temperature.) We must therefore 
infer which features are relevant to which APs in which parts of 
the AP space. As this is generally a nonlinear, multidimensional 
problem, the standard approach is to use a machine learning algo- 
rithm to learn the mapping from the data space to the AP space 
based on labelled template spectra (spectra with known APs). Var- 
ious models have been used in astronomy for a variety of prob- 
lems, including (to name just a few): neural networks for stellar 
parameter estimation (e.g. Re Fiorentin et al. 12007b or photomet- 
ric redshift estimation (e.g. Firth et al. 2003); support vector ma- 
chines for quasar classification (e.g. Gao et al. 120031 Bailer- Jones 
et al. 120081 ) or galaxy morphology classification (e.g. Huertas- 
Company et al. 2008); classification trees for identifying cosmic 
rays in images (e.g. Salzberg et al. U995t ; linear basis function pro- 
jection methods (e.g. the method of Recio-Blanco et al. [2006 for 
spectral parameter estimation, which constructs the basis functions 
from model spectra). More examples of machine learning methods 
and their use in astronomy can be found in the volume edited by 
Bailer-Jones ( 12008b . 

Note that the first category of methods, line indices, is really 
just a special case of the second in which drastic feature selection 
has taken place to enable use of low (one or two) dimensional mod- 
els. In both cases we must learn some relationship AP = (/(Data) 
from a model or based on some labelled templates. But this is an 
inverse relation: more than one set of APs may fit a given set of data 
(e.g. a low extinction cool star or high extinction hot star could pro- 
duce the same colours). Despite this non-uniqueness we nonethe- 
less try and fit a unique model. This causes fitting problems which 
become more severe the lower the quality of the data (the lower the 
number of independent measures) and the larger the number of APs 
we want to estimate from it, and could lead to poor AP estimates 
or biases. In contrast, the forward mapping (or generative model), 
Data = <?(AP) is unique, because this is a causal, physical model 
(e.g. a stellar atmosphere and radiative transfer model of a spec- 
trum). A further issue is that the model must learn the sensitivity 
of each input to each AP (and how noise affects this). Yet this in- 
formation we in principle have already from the gradients of the 
generative model. 

A pattern recognition method which tries to overcome some 
of these problems is fc-nearest neighbours, which has also been ap- 
plied to many problems in astronomy (e.g. Katz et al. 119981 Ball 
et al. [2008 ; plus extensions thereof such as kernel density estima- 
tion, e.g. Richards et al. 120041 or the method of Shkedy et al. 120071 
which converts the distances into likelihoods and uses priors to cre- 
ate a full probabilistic solution.) In many ways this is the most natu- 
ral way to solve the problem: we create a grid of labelled templates 
and find which are closest to our observation (perhaps smooth- 
ing over several neighbours). On the assumption that the APs vary 
smoothly with the data between the grid points, this may provide a 
good estimate. But for this to be accurate (and not too biased), the 
grid must be sufficiently dense that multiple grid points lie within 
the error ellipse of the observation (the covariance of these neigh- 
bours then provides a measure of the uncertainty in the estimated 
APs). If the SNR is high, the grid must therefore be very dense. 



Moreover, as the number of APs increases, the required grid den- 
sity grows exponentially with it: For a stellar parametrization prob- 
lem with 5 APs we might need an average of 100 samples per AP, 
resulting in 100 5 = 10 10 templates. There is also the issue of what 
distance metric to use. The covariance-weighted Mahalanobis dis- 
tance is often used, but it ignores the sensitivities of the inputs. (If 
some inputs are sometimes dominated by irrelevant cosmic scatter, 
this will add unmodelled noise to the distance estimate.) This is a 
problem when we have a mix of APs, some of which have a large 
and others a small impact on the variance ("strong" and "weak" 
APs). If we just use the Mahalanobis distance we loose sensitivity 
to the weak APs. 

We could overcome these problems if we did on-the-fly inter- 
polation of the template grid to generate new templates as we need 
them. Running stellar models is far too time consuming for this, 
but also unnecessary because the generative model is smooth: We 
can instead fit a forward model to a low density grid of templates 
as an approximation to the generative model. As we shall see, pos- 
session of a forward model opens up opportunities not available 
to the inverse methods, such as direct uncertainty estimates and 
goodness-of-fit assessment of the solution. 

In this article I introduce an algorithm for AP estimation based 
on this forward modelling idea and iterative interpolation. I will 
demonstrate it using simulations of low resolution spectra to be ob- 
tained from the Gaia mission (e.g. Lindegren et al. 2008)Q Gaia 



will observe more than 10 9 stars down to 20*^ magnitude over 
the whole Galaxy, stars which a priori span a very wide range in 
several APs. This includes the line-of-sight extinction parameter, 
Ay, which must be estimated accurately if we want to derive in- 
trinsic stellar luminosities from the Gaia parallaxes. AP estimation 
(Bailer- Jones 2005 1 is therefore an integral part of the overall Gaia 
data processing and comprises one of the Coordination Units in the 
Gaia Data Processing and Analysis Consortium (DPAC) (Mignard 
& Drimmel l2007l O'Mullane et al. l2007l . 

I will now describe the basic algorithm (section|2](. In section|3] 
I then introduce the simulated Gaia spectroscopy to which ILIUM 
is applied, with the results and discussion thereof presented in sec- 
tions|4]|5]and|6] The latter section also reports on a strong and ubiq- 
uitous degeneracy between T c ff and Av. I summarize and conclude 
in section[8] Additional plots, results and discussions can be found 
in a series of four Gaia technical notes (Bailer-Jones 2009a,b,c,d) 
available from|http : / /www .mpia . de/ Gaia| 



2 THE ilium ALGORITHM 

I outline the algorithm using the terminology of spectra and stellar 
astrophysical parameters, although it is quite general and applies 
to any multivariate data. Table [T] summarizes the notation. "Band" 
refers to a flux measurement in the spectrum. In general it could be 
a photometric band, a single pixel in the spectrum, or a function of 
many pixels. 



2.1 Forward modelling 

I will call the true relationship between the APs and the flux in 
a band i the generative model, gi(4>). This provides the observed 
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Table 1. Notation 



number of bands (pixels in spectrum) 
counter over band, i = 1 . . . I 
photon counts in band i (p is a spectrum) 
number of APs (astrophysical parameters) 
counter over AP, j = 1 . . . J 
AP j (4> is a set of APs) 
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Figure 1. Sketch of the search method for one band (7=1) and one AP 
(J=l). The dashed blue curve is the (unknown) generative model, and the 
solid blue curve is the forward model (our approximation to the generative 
model) formed by fitting a function to the templates. (The difference be- 
tween the two is exaggerated.) The straight red line shows the local linear 
approximation of the forward model (tangent at <f>i) used to calculate the 
first AP step. (In the case show the forward model could be inverted. But 
this is not generally the case, not even in one dimension if it has a turning 
point.) 



spectrum for a given set of APs and thus encapsulates the underly- 
ing stellar model, radiative transfer, interstellar extinction, instru- 
ment model etc. (There is a separate function for each band, but for 
simplicity I will refer to it in the singular.) Generally we don't have 
an explicit function for this model, so it remains unknown. All we 
have for doing AP estimation is a discrete template grid of exam- 
ple spectra with known APs generated by the generative model. We 
approximate the generative model using a forward model, fi(4>), 
which is a (nonlinear) parametrized fit to this grid and provides 
flux estimates at arbitrary APs, i.e. off the grid. (Forward models 
can be fit independently for each band.) Demanding the forward 
models to be continuous functions ensures we can also use them to 
calculate the sensitivities, which by definition are the gradients of 
the flux with respect to each AP. The forward model fitting is done 
just once for a given grid and is kept fixed when predicting APs. In 
other words it is a training procedure. 



2.2 Core algorithm 

The basic idea of ILIUM is to use the Newton-Raphson method to 
find that forward model-predicted spectrum (and associated APs) 
which best fits the oberved spectrum. 

In detail, the algorithm is as follows (Fig.[T]l. Consider first a 
single AP and single band. The measurement is p(0) and we want 



to estimate its AP. The forward model, p — f(<f>), has been fit 
and remains fixed. The procedure is as follows (n is the iteration 
number) 

(i) Initialize: find nearest grid neighbour to p(0), i.e. the one 
which minimizes the sum-of-squares residual 8p T 8p. Call this 

0(1)]. 0(1) is the initial AP estimate. 

(ii) Use the forward model to calculate the local sensitivities, 
J?, at the current AP estimate. 

(iii) Calculate the discrepancy (residual) between the predicted 
flux and the measured flux, Sp(n) = p(n) — p(0). 

(iv) Estimate the AP offset as 5<t>(n) — ( 

i.e. a Taylor expansion truncated to the linear term. (Note that this 
partial derivative is the reciprocal sensitivity.) 

(v) Make a step in AP space, <j>(n + 1) = <f>(n) — 6<f>(n), toward 
the better estimate. This is the new AP prediction. 

(vi) Use the forward model to predict the corresponding (off- 
grid) flux, p(n + 1) 

(vii) Iterate steps ii-vi until convergence is achieved or a stop is 
imposed. 

The algorithm is basically minimizing \Sp\. At each iteration we 
obtain an estimate of the APs (step v) and the corresponding spec- 
trum (step vi). Convergence could be defined in several ways, e.g. 
when changes in the spectrum or the APs (or their rate of change) 
drop below some threshold. Alternatively we could simply stop af- 
ter some fixed number of iterations. There is no guarantee of con- 
vergence. For example, if the AP steps were sufficiently large to 
move to a part of the function with a sensitivity of the opposite 
sign, then the model could diverge or get stuck in a limit cycle. 
Likewise, if initialized too far from the true solution the algorithm 
could become stuck in a local minimum far from the true solution. 
For this, and other reasons, the algorithm in practice has some ad- 
ditional features (discussed in section[23]l. Also, 



2.3 Generalization to multiple APs and bands 

In general we have several bands and several APs. The flux pertur- 
bation due to small changes in the APs is then 



5p 



(1) 



where S is the I x J sensitivity matrix with elements sy = 
dpi/dcpj. Note that I > J. Multiplying this equation on the left 
by (S T S)- 1 S T gives 

5<p = (S T S)" 1 S T 5p 



so the AP update equation (step v in the algorithm) becomes 
<p{n + 1) = cf>(n) - (S T Sy 1 S T 5p(n) 



(2) 



(3) 



The / forward models are now functions of J variables, and this 
turns out to be a critical matter. 



2.4 The forward model 

The core algorithm just described can make use of any form for 
the forward model, on the condition that it provides values of the 
function and its first derivatives for arbitrary values of the APs. 

The most obvious forward model would be a multidimen- 
sional, nonlinear regression of the form p = /(</>), which in prin- 
ciple works for any number of APs. However, I found that it was 
difficult to get a model which simultaneously fits both T c g , a strong 
AP, and log g, a weak AP to sufficient accuracy. Strong here means 
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Figure 2. Schematic diagram of a two-component forward model. Note the 
stronger variation in the flux in the direction of the strong AP (the contrast 
is typically much larger for the problems considered in this article). 




strong AR k 



S d 




Figure 3. Schematic diagram of the grid of AP values, k = 1 . . . 20, I = 
1 ... 7. The solid (red) points denote those used to fit the weak component 
of the forward model for one value of the strong AP. 



that it explains much of the variance in the flux data, i.e. is a strong 
predictor of the flux. Weak is a relative term, indicating that the AP 
explains much less of the variance. The reason for poor fits over the 
weak APs is that the model is fit by minimizing a single objective 
function, namely the error in reproducing the flux. As the weak AP 
has very little impact on the flux, its influence has little impact on 
the error, so the model optimization does little to produce a good fit 
over this AP. 

To overcome this problem I use a two-component forward 
model to separately fit the strong and weak APs. Consider the case 
of a single strong AP and a single weak AP (this is generalized 
later). The strong component is a ID nonlinear function of the 
strong AP which is fit by marginalizing over the weak AP. This 
fits most of the flux variation. Then, at each discrete value of the 
strong AP in the grid, we fit the residual flux as a function of the 
weak AP; these are the second components (also ID). They provide 
a flux increment dependent on the weak AP, which is added to the 
flux predicted given the strong AP. A schematic illustration of such 
a two-component forward model is illustrated in Fig. [2] 



Figure 4. Schematic illustration of the two components of the forward 
model fit over all the strong and weak points in the grid in Fig. [5] Top: 
the fit over the strong APs. Bottom: one of the fits over the weak APs (the 
solid/red points in the top panel at k=14). 



To be more precise the model is fit as follows. Let (f> s denote 
a strong AP, 4> w a weak AP and /,(0 S , 4> w ) the complete (2D) 
forward model for band i. Let subscripts k and I denote specific 
values in the grid of the strong and weak APs respectively (see 
Fig.[3j. We model the flux at an arbitrary AP point as 



fr 



= fi 



+ 



(4) 



where both ff and //^ are ID functions. ff(4> S ) is the single 
strong component (for band i). It is a fit to the average value of 4> w 
at each (j> s (the curve in the top panel of Fig. ffj. is the k th 
weak component, which is a fit with respect to the weak AP with 
the strong AP fixed at <j>f (e.g. the solid/red points in Fig. [3}, i.e. it 
is fit to the residuals 



=<t>%)-p{<t> 



as illustrated in in the bottom panel of Fig. [4] 

This fitting approach clearly requires us to have a "semi- 
regular" grid: one which has a range of values of the weak AP for 
each value of the strong AP. (This requirement is easily fulfilled 
when using synthetic grids.) The number of weak components in 
the model is equal to the number of unique values of (f> s , 20 in this 
schematic case. If we have more than one strong or weak compo- 
nent then we raise the dimensionality of the strong or weak com- 
ponent (see section|6]l. 

Applying the forward model is easy. Given (0 s , <f> w ) we 



(i) evaluate /; , the strong component; 

(ii) find the nearest value, <f>f. 
closest weak component; 

rW 



in the grid to cf> , i.e. identify the 



(iii) evaluate f i k , the increment from the weak component; 
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(iv) sum the two components, ff + fj^., to give the forward 
model prediction. 

Although the weak component we use changes discontinuously as 
4> s varies, the weak component is only specifying an increment to 
the strong component fit. As both components are smooth in their 
respective APs the combined function is also smooth along any di- 
rection parallel to the AP axes. It is not smooth along arbitrary di- 
rections, but this is unimportant because explicit calculations with 
the forward model (e.g. of the sensitivities) are only carried out 
parallel to the AP axes. 



2.5 Practical algorithm 

The practical realities of working with real (noisy) data mean that 
the basic algorithm should be extended in order to make it more 
robust. These I now describe together with other implementation 
aspects used for the experiments described later. For thepurposes 
of this paper the algorithm has been implemented in Frl A Java 
implementation is in progress, which will be necessary for larger 
scale applications. 



data with a similar noise level as the target data (e.g. Snider et 
al. l2001t .This precipitates the need for multiple models when used 
on survey data with a range of SNRs, something which is not an 
issue for ILIUM. 



2.5.3 Sensitivity estimation 

If the forward model is a simple analytic function then it may have 
analytical first derivatives which can be used to calculate the sen- 
sitivities. But in the general case we can use the method of first 
differences 



dp 



f{4> + <%) - 



(5) 



I select S(j>j to be slightly smaller than the maximum precision a 
priori possible in an AP. As the forward model must be smooth 
at this resolution, the first difference approximation is sufficiently 
accurate. For the examples shown later, I choose 5<f)j to be 0.05 
dex for log 5 and [Fe/H], 0.0005 for log(T cff ) (0.1% for T cff ) and 
0.03 mag for Ay. 



2.5.1 Standardized variables 

The spectral variables are observed photon counts (to within an ir- 
relevant constant factor). The APs are all on logarithmic scales: Av 
in magnitudes, log g and [Fe/H] in dex, and log(T e ff). In order to 
bring each variable to the same level I standardize the flux in each 
band and each AP (linearly scale each to have zero mean and unit 
variance). If there were correlations in the spectra these could be re- 
moved by "sphereing" ("prewhitening"), a covariate generalization 
of standardization which gives the data unit diagonal covariance 
(e.g. Bishop l2006l , 



2.5.4 Lower limit on sensitivities (singularity avoidance) 

Given that the AP updates depend upon the inverse of the sensitivity 
matrix (equation |2|, it is prudent to prevent the sensitivities being 
too small in order to avoid S T S being singular. For this reason, a 
lower limit is placed on the absolute value of each sensitivity, Sij, 
of 0.001 (with p and <f> in standardized units). For the examples 
shown later, this limit rarely had to be applied in practice and had 
negligible impact on the results. To avoid singularity S T S must 
also have a rank of at least J, i.e. to estimate J APs we need at 
least J independent measures in the spectrum. 



2.5.2 Forward model functions 

The strong and weak components of the forward model are fit us- 
ing smoothing splines (e.g. Hastie et al. 12001 1 . Conventional cu- 
bic splines have the drawback that one must control their com- 
plexity (smoothness) using the number and position of the knots. 
Smoothing splines circumvent this problem by setting a knot at 
every point (which would overfit the data) and then applying a 
smoothing penalty which is controlled by specifying the effective 
degrees-of-freedom (dof). I set this by trial and error via inspec- 
tion of the resulting fits. For the 2D problems TG (T e g and log g) 
and TM (T e g and [Fe/H]) described in sections [4] and [5] both the 
strong and the weak model splines are ID. The strong model uses 
dof= nTcff/2 = 16.5 where nTeff is the number of unique T e g 
points. As the maximum number of log g points is 10 (for the train- 
ing data), and because the variation with log g is quite smooth, I set 
the dof for these fits to be 4. However, many of the T e g values in the 
training grid have fewer log g or [Fe/H] points: To avoid overfitting, 
if nio gg (T , off ) < 4 then a linear fit is used. If n\ os& (T e s) = 1, then 
no fit is performed and this weak component of the forward model 
is zero. (Practical aspects of higher dimensional forward models 
are described in section[6]) 

The forward model is always fit to noise-free data. It is 
well known (and the author's experience) that inverse modelling 
methods such as ANNs and SVMs perform best when trained on 
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2.5.5 AP update contribution clipping 
Equation|2]can be written 

4>(n + 1) = 4>(n) - M Sp(n) (6) 
where 

M = (S T S)" 1 S T (7) 

is a J x J matrix. Equation|6]gives J update equations, one for each 
AP. The update for AP j can be written as the dot product of two 
vectors, the j th row of M, m 3 with p(n), i.e. 

0j(n + l) = (f)j(n) - rrij Sp(n) 

= 4>j{ n ) — mij5pi(n) 

i 

= 4>j{n) -^Wj-(n) (8) 

i 

which defines Uij . Thus we see that the update to AP j is a sum over 
/ individual updates, which we can view as an update "spectrum". 
If we inspect these updates (see Bailer- Jones 20093 we see mat on 
occasion some are much larger than the others. This dominance of 
the update by just one of a few elements is undesirable, because 
they may be affected by noise (5p(n) is a noisy measurement). For 
this reason, I clip outliers in this spectrum. (It is valid to compare 
the updates for different bands, because we work with standardized 
fluxes.) To be robust, I set an upper (lower) limit which is a multiple 
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c of the median of those points above (below) the median. Using the 
notation 9Q to denote median, the limits are 

Muppci = 0{Ui) + c[6(Ui > 9(Ui)) - 6{Ui)) 

Ulowor = 6(Ui) + c[6( Ui < 6( Ui )) - 0(Ui)} (9) 

I somewhat arbitrarily set c = 10 so as to be relatively conservative 
in clipping. 

2.5.6 Upper limit on AP update step size 

The AP steps at each iteration (equation [2]( could be very large. 
This is undesirable, because the updates are based on a local linear 
approximation to the generative model. The code therefore imposes 
upper limits on the AP updates, corresponding to steps no larger 
than about 2.0 dex in logp and [Fe/H], 0.04 in log(T eff ) (10% in 
T e g) and 0.3 mag in Av. A larger step size is permitted for the 
weaker APs because the initial nearest neighbour offset can be quite 
incorrect. These limits are imposed more often for noisy data, but 
still relatively rarely. 

2.5.7 Limit AP extrapolation 

We do not expect the forward model to make good predictions be- 
yond the AP extremes of the grid, so I set upper and lower limits 
on the AP estimates which ILIUM can provide. These are set as e 
times the range of each AP, i.e. 

upper limit = max^j + e(max0j — min </>.,) 
lower limit = min (j>j — e(max (f>j ■— min cj>j) (10) 

Isete= 0.1. 

2.5.8 Stopping criterion 

The algorithm is simply run for a fixed number of iterations (20). 
We often observe good natural convergence, so a more sophisti- 
cated stopping criterion is not applied at this time, although it may 
be important on high variance data sets (low SNR or more APs). 



as a function of the sensitivity (calculated at the estimated APs) 
and the covariance in the measured photometry, Cp. (This equa- 
tion assumes that ILIUM provides unbiased AP estimates and that 
the sensitivities have zero covariance. It can also be written = 
MC P M T where M is the update matrix introduced in equation]?]) 
C p can be estimated from a photometric error model, and will be 
diagonal if the photometric errors in the bands are independent. 
Even in this case C4, is generally non-diagonal: the AP estimates 
are correlated on account of the sensitivities. 

Because we have a forward model we can calculate a 
goodness-of-fit (GoF) for any estimate of the APs. Here I simply 
use the reduced-x 2 to measure the difference between the observed 
spectrum and predicted spectrunj^] 

where pi — fi{(f>j) is the forward model prediction and {crp\} = 
diag(Cp) is the expected photometric noise. (Despite the name, a 
larger value refers to a poorer fit!) As the GoF can be measured 
without knowing the true APs, it can be used for detection of poor 
solutions or outliers. 

Conventional methods of AP estimation via direct inverse 
modelling (e.g. with SVMs or ANNs) do not naturally provide un- 
certainty estimates and must usually resort to time-intensive sam- 
pling methods, such as resampling the measured spectrum accord- 
ing to its estimated covariance. They cannot provide a GoF at all 
because they lack a forward model. 

2.8 Signal-to-noise weighted AP updates 

The update equation ([2j only takes into account the sensitivity of 
the bands, not their SNR. However, even if a band is very sensitive 
to an AP in principle, if its measurement is very noisy then it is 
less useful. We could accommodate this by including a factor pro- 
portional to Cp 1 into equation|2]which would down- weight noisier 
measurements. Preliminary results using this on the TG problem 
(see section [3~2| l show it actually degrades performance at G=15, 
but gives some improvement at G=18.5 (Bailer- Jones 2009c). 



2.6 Performance statistics 

The model performance is assessed via the AP residuals (estimated 
minus true, 5(j>) on an evaluation data set. I report three statistics 
: (1) the root-mean-square (RMS) error, which I abbreviate with 
e rms ; (2) the mean absolute error, \5(f>\, abbreviated as e mas ; (3) the 
mean residual, 5(f), a measure of the systematic error, abbreviated 
as e syE . (Note that as (1) and (2) are statistics with respect to the 
true values they also include any systematic errors.) I mostly use 
e mac rather than RMS because the former is more robust. If the 
residuals had a Gaussian distribution then the RMS would equal 
the Gaussian la which is \fnj i 2 = 1.25 times larger than e mae - 
But usually there are outliers which increase the RMS significantly 
beyond this. 

2.7 Uncertainty estimates 

If vectors y and x are related by a transformation y = Aa; then 
a standard result of matrix algebra is that the covariance of y is 
C y — AC X A T where C x is the covariance of x. Applying this to 
equation|2]gives us an expression for the covariance in the APs 

C = (S T S)" 1 S T CpS(S T S)" 1 (11) 



3 ASSESSING THE ALGORITHM 
3.1 Gaia simulations 

To illustrate ILIUM I apply it to estimate stellar APs from simu- 
lated Gaia stellar spectra and thereby also make preliminary pre- 
dictions of the mission performance. Gaia will observe all of its 
targets with two low-dispersion slitless prism spectrographs, to- 
gether covering the wavelength range from 350-1050 nm. (These 
are creatively called "BP" for blue photometer and "RP" for red 
photometer.) The dispersion varies from 3 nm/pixel at the blue end 
to 30 nm/pixel at the red end (Brown 2006). The blue and the red 
spectra are each sampled with 60 pixels, but as the line-spread- 
function of the spectrograph is much broader, these samples are 
not independent. After removing low SNR regions of the modelled 
spectra, I retain 34 pixels in BP covering 338-634 nm and 34 pixels 
in RP covering 667-1035 nm. This is a slightly narrower range (and 
18 pixels fewer) than the one adopted by Bailer- Jones et al. (2008) 
for quasar classification with similar spectra. 

3 I use I — 1 degrees of freedom rather than / because all the spectra have 
a common G magnitude, so the bands are not all strictly independent. 
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Figure 5. The T c ff-log g grid of the data used in the experiments. 



Extensive libraries of BP/RP spectra have been simulated by 
the Gaia DPAC using the GOG (Gaia Object Generator; Luri et 
al. (2005}, Isasi[2p09 l instrument model and libraries of input spec- 
tra]^] Here I use the Basel (Lejeune et al. 11997b and Marcs (e.g. 
Gustafsson et al. 2008 1 stellar libraries. The former (as used here) 
includes 17 T e ft values from 8000-15 000 K with non-uniform 
spacing and the latter 17 T e ff values from 4000-8000 K in uni- 
form steps of 250 K. (Together there are 33 unique T c ff values 
because 8000 K is in both.) Together the grids span log</ val- 
ues from —0.5 to 5.0 dex in steps of 0.5 dex, although the grid 
is incomplete for astrophysical reasons (Fig. [5}. [Fe/H] ranges 
from —4.0 dex to +1.0 dex with 13 discrete values for the cooler 
stars (with T e ff < 8000 K): The spacing is 0.25 dex from +1 to 
-1, followed by points at -1.5, -2.0, -3.0 and -4.0. Not all 
[Fe/H] are present at all Tcff-log g combinations. Each star has 
been simulated at one of ten values of the interstellar extinction, 
A v e {0, 0.1, 0.5, 1, 2, 3, 4, 5, 8, 10} with R v = 3.1. (Note that A v 
is the extinction parameter defined by Cardelli et al. 1989 It is not 
the extinction in the V band.) The Marcs library additionally shows 
variation in the alpha element abundances, [a/Fe], representing five 
values from 0.0-0.4 dex in 0.1 dex steps. Hence this combined li- 
brary shows variance in five APs, T e fj, Av, log<? [Fe/H], [a/Fe], 
the first two of which are "strong" and the latter three "weak". The 
total number of spectra is 46 310. ILIUM will be used to estimate 
the first four APs; [a/Fe] will be ignored and contributes cosmic 
scatter. 

GOG simulates the number of photoelectrons ("counts") in the 
spectral bands. (While these will be calibrated in physical flux units 
before being published to the community, the classification work 
by DPAC is currently done in photoelectron space.) The variance 
in the spectra due to the four APs of interest is demonstrated in the 
example spectra plotted in Figs.[6]|7j[8]and[9] The first plot shows 
the variance due to T e g only. The break between the BP and RP 
instruments around 660 nm is clear, as is the highly variable disper- 
sion. The lower counts in RP immediately to the right of the break 
compared to BP is primarily due to the higher dispersion (fewer 
photons per band). In the other plots two APs are varied while the 
other two are held constant. Fig. [^demonstrates the strong impact 
of both T c ff and Ay variations. (As I will discuss in section |6l4| 

4 For Gaia pundits: I use the CU8 cycle 3 simulations of the nominal (dis- 
crete) libraries (Sordo & Vallenari l2"0"08l 
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Figure 6. Noise-free Gaia spectra for solar metallicity dwarfs at zero extinc- 
tion with T eft = {4000, 5000, 6000, 7000, 8500, 10000, 15000} K, increas- 
ing monotonically from bottom (red) to top (violet) at long wavelengths. 
They are composed of two spectra (BP and RP) to the left and right of 
about 660 nm. The ordinate is in units of photoelectrons (to within some 
constant multiple), not energy flux. 
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Figure 7. Noise-free Gaia spectra at a range of T e ff (as in Fig. [6) and Av 
(0.0, 0.1, 0.5, 1, 2, 3, 4, 5, 8, 10) ranging from O.Omag (lowest line at long 
wavelengths; in red) to 10.0 mag (highest line at long wavelengths; in vio- 
let). Each temperature block has been offset by 1200 counts for clarity (the 
zero levels are shown by the dashed lines), log g = 4.0 dex, [Fe/H]= 0.0 dex 
and [a/Fe] = 0.0 dex in all cases. 
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Figure 8. Noise-free Gaia spectra at a range of T e g and logg 
(—0.5,0.5,2,3,4,5), with the lowest gravity (red) forming the lowest 
curve at the red end of the spectrum and the highest gravity (violet) the 
highest. Note that not all gravities are present at all T c ff due to the lim- 
itations of reality. Each temperature block has been offset by 600 counts 
for clarity (the zero levels are shown by the dashed lines) Ay = 0.0 mag, 
[Fe/H]=0.0dex and [a/Fe] =0.0dex in all cases. 



Figure 9. Noise-free Gaia spectra at a range of T en - and [Fe/H] 
(—3, — 2, — 1, 0, +0.5), with the lowest metallicity (red) forming the low- 
est curve at the red end of the spectrum and the highest metallicity (violet) 
the highest. Note that not all metallicities are present at all T c ff due to lim- 
itations of the simulated libraries. Each temperature block has been offset 
by 600 counts for clarity (the zero levels are shown by the dashed lines) 
Ay = 0.0 mag, log g = 4.0 dex and [a/Fe] =0.0 dex in all cases. 



these two APs are highly degenerate in these data.) The third and 
fourth figures clearly demonstrate why log g and [Fe/H] are weak 
parameters: they have little impact on the spectra compared to the 
variance due to T c ff or Ay. In particular, at high temperatures the 
spectra show essentially no sensitivity to [Fe/H]. 

The AP estimation accuracy is, of course, a strong function of 
the SNR in the spectrum. As Gaia has a fixed sky scanning law, 
the SNR depends on the source magnitude and the number of ob- 
servations (because the individual observations are combined into 
a single end-of-mission spectrum). I adopt here 72 observations for 
all spectra (the mean number of observations per source for a 5-year 
mission), and report results as a function of just the magnitude. The 
simulator noise model takes into account the source, background 
and background-subtraction photon (Poisson) noise as well as the 
CCD readout noise. At present errors due to the combination of 
spectra, charge-transfer inefficiency and CCD radiation damage are 
not explicitly included. There is instead a factor to account for gen- 
eral processing and calibration errors. All of these noise terms are 
combined into a zero mean Gaussian model for each pixel, the stan- 
dard deviation of which is a function of the G-band apparent mag- 
nitude. (The G-band is the filterless band - defined by the mirror 



and CCD response - in which the Gaia astrometry is obtained, cov- 
ering a range similar to BP/RP.) This defines a "sigma spectrum" 
for each star from which I generate random numbers in order to 
simulate noisy spectra at G=15, 18.5 and 20. The resulting SNR 
is a strong function of wavelength and the specific source, and is 
summarized in Fig. 1 1 0| 

3.2 Data sets and forward model fitting 

In the following sections I will apply Gaia to four distinct problems 
according to the APs we are trying to determine. In each case the 
forward model is fit to a grid varying in at least those APs, and 
in some cases the grid also shows variance in another AP (which 
therefore acts to provide additional "cosmic scatter"). For each case 
the ILIUM forward model is fit (trained) using the full, noise-free 
data set over the complete range of the represented APs. The cor- 
responding noisy data set (at G=15, 18.5 or 20) is split randomly 
into two equal halves: one half is used for initialization (selecting 
the nearest neighbour) and ILIUM is applied to the other half (or a 
random subset of it where indicated below) on which the perfor- 
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Figure 10. The median signal-to-noise ratio (SNR) (solid line) and 0.1 and 
0.9 quartiles (dashed lines) across the set of zero extinction dwarf stars for 
G=18.5 (red/thicker lines) and G=20.0 (blue/thinner lines). Compared to 
the G=18.5 curve, the SNR at G=15 is 8-6 times larger between 400 and 
660 nm and 7-12 times larger between 660 and 1000 nm. 



mance is evaluated^] (The legitimacy of this procedure for evaluat- 
ing performance is discussed in appendix[A]) In addition to report- 
ing global results (over the full AP ranges present in the training 
set) I also measure performance on subsets of the evaluation data 
set, in particular the [Fe/H] just for cool stars (the ILIUM forward 
model is not refit). The problems and their grids are as follows (the 
name indicates the APs being determined: Temperature, Gravity, 
Metallicity and/or Extinction) 

• TG: Estimation of Tcff+logg (1 strong and 1 weak AP), for 
stars with Av = and [Fe/H] = 0, some 274 stars. I also build a 
second TG model (TG-allmet) which is trained and evaluated with 
the full [Fe/H] range in the grid (—4 to +1 dex). This contains 4361 
stars, of which a quarter are used in the evaluation set. 

• TM: Estimation of T c h + [Fe/H] (1 strong and 1 weak AP), for 
stars with A v =0 and either loggG {4.0,4.5,5.0} (TM-dwarfs; 
1716 stars), orlogp G {1.0,1.5,2.0,2.5,3.0}, (TM-giants; 1882 
stars). I also build a third model (TM-allgrav) which is trained and 
evaluated with the full logy range in the grid (—0.5 to +5 dex). 
This contains 4361 stars (it's the same grid as used for TG-allmet 
of course), of which a quarter are used in the evaluation set. 

• TAG: Estimation of (T e fi, Ay)+logg (2 strong and 1 weak 
AP), for stars with [Fe/H] = 0. This has 2740 stars, of which a ran- 
dom selection of 1000 is used for evaluation. 

• TAM: Estimation of (T e ff, Av)+[Fe/H] (2 strong and 1 weak 
AP), for dwarfs with log p G {4.0, 4.5, 5.0}. This has 17 160 stars 
of which a random selection of 1000 is used for evaluation. 

• TGM: Estimation of T e a+(logg, [Fe/H]) (1 strong and 2 



5 These target spectra must first be normalized to have the same counts 
level as those on which ILIUM was trained. For this I just use the G mag- 
nitude to scale the counts. However, as the G-band is not identical to the 
BP/RP band adopted, this gives rise to a normalization offset between spec- 
tra even for a common G magnitude. For example, over the TG grid the in- 
tegrated BP/RP counts varies by up to 10%, dependent primarily on T B ff . A 
better normalization might be "area normalization", i.e. dividing the counts 
in each band by the sum over all bands for that spectrum. My G-band nor- 
malization is in principle conservative, as it mimics a small calibration bias. 



weak APs, for stars with Av =0. This has 4361 stars of which a 
random selection of 1000 is used for evaluation. 

In appendix|B]the ILIUM results are compared with results from an 
SVM on some of these problems. 



4 APPLICATION TO THE T off +log g PROBLEM (TG) 

First the forward model is fit for each band to the T c ff-log g grid 
shown in Fig. [5] As a reminder, each forward model comprises the 
ID function over T e g (the strong component) and 33 ID functions 
in log g (the weak components), as described in section [2T4] All of 
these are smoothing splines (section [2.5.2[ l. Figs. |1 l|and|12| show 
the forward model fit for 12 bands at cuts of constant log g and T e g 
respectively. The fits are good: they show the degree of smoothness 
we would expect for these data plus a robust extrapolation. If we 
compare the flux scales between Figs. 1 1 1 1 and 1 12| (these are stan- 
dardized variables) we see how small the flux variation is as log g 
varies over its full range compared to T e g: this is what it means 
to be a weak AP. Note also the small discontinuity in some of the 
bands in Fig.[TT]at 8000 K (log(T eff ) = 3.903) where the Marcs and 
Basel libraries join. 

Having trained ILIUM I apply it to the evaluation data set at 
G=15. Fig.[T3]shows five examples of how the AP estimates evolve. 
The first iteration is the nearest neighbour initialization; the final 
is the adopted AP estimate. The red (horizontal) lines show the 
true APs. Looking at many examples we see a range of conver- 
gence behaviours. Sometimes convergence is rapid, for other stars 
it takes longer. Sometimes it is smooth, other times not. It can be 
quite different for the two APs for a given star and depends also 
on the specific spectrum (which is noisy). Sometimes the nearest 
neighbour estimate is the correct one, and ILIUM may actually it- 
erate away from this and converge on a different value. Conver- 
gence (on something) is almost always achieved on this problem, 
even though there is nothing adaptive in the algorithm. This is an 
encouraging property. Limit cycles are also seen in a handful of 
cases, but with negligible amplitudes on this problem (high SNR). 
For this specific problem, the extrapolation limits on the AP esti- 
mations (sec tion | 2.5.7fr and the limits on the AP updates at each 
step (section [2.5.6 1 never had to be enforced by the algorithm. 

Fig. [14] plots the ILIUM residuals (estimated minus true APs) 
on the 137 stars in the evaluation set; the statistics are summarized 
in line 1 of Table|2] We see that the APs can be estimated very accu- 
rately (no significant systematic error, e sys ) and very precisely (low 
scatter, e mac or e rma ). Applying the same model to noisier spectra 
obviously increases the errors (lines 2 and 3 in the table), but at 
G=18.5 the mean absolute errors are still an acceptable 1% in T e g 
and 0.35 dex in logg. At G=20 log g cannot be estimated accurately 
enough to reliably distinguish dwarfs and giants, although T e ff is 
still okay with an expected error of 260 K at 6000 K, for example. 

This evaluation set is relatively small so the error statistics are 
subject to variation. Applying ILIUM to an ensemble of randomly 
selected evaluation sets (at G=18.5), we see that the error statistics 
vary by 5-10% (inter-quartile range) of their mean. The reported 
values are therefore reasonably representative. 

The performance of an SVM on these TG problems is given in 
Table |B 1 1 For both APs and all three magnitudes ILIUM is similar to 
or significantly better than SVM. (As there is some variance in the 
results from both methods I only consider the performance signif- 
icantly different if the better one has a e mac at least 25% smaller.) 
So in this limited variance problem, at least, the forward modelling 
approach improves performance. 
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Table 2. Summary statistics (defined in section |2~B) of the ILIUM performance on various experiments. The first column indicates the model and grid used 
(defined in section [3~2) . The second column defines the magnitude and composition of the evaluation data set: F means the full AP ranges (as present in the 
training set); L means only stars with T e g < 7000 K. "<" in the e syB column indicates that the systematic error has a magnitude less than twice the standard 
error in the mean (|e sy8 | < 2 X 1.25 £ma.e/V~N), i.e. is statistically insignificant. (Quite a few not so marked are only marginally significant.) The units of the 
error statistics are logarithmic for all variables (dex for all except Av which is in magnitudes) and are on the proper variables (not the standardized variables). 
Multiplying by 2.3 converts to fractional errors for log(T e ff ). 
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Figure 11. Predictions of the full forward model for the TG problem as a function of log(T e ff ) with log g fixed at 4.0 dex for 12 different bands (the central 
wavelength of which in indicated at the top of each panel in nm). The black crosses are the (noise-free) grid points, the small red stars are the forward model 
predictions (at randomly selected AP values) and the blue circles the noisy (G=15) grid points. The photon counts plotted on the ordinate are in standardized 
units. 
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Figure 12. As Fig.[TT] but now showing predictions of the full forward model as a function of log g at constant T c g =5000 K. 



It is hard to fairly compare the performance with nearest 
neighbours (NN), because on the one hand NN is limited by the 
density of its template grid, but on the other hand it can report the 
exact AP values. If we used a full, noise-free template grid and 
noisy evaluation objects, then provided the noise is low enough, 
1-NN will give exact results. If we instead split the template and 
evaluation sets to have no common objects, then the precision of at 
least one of the APs is limited by the grid spacing. We could instead 
average over the k nearest neighbours, but then NN does badly be- 
cause it averages over a wide range of the weak AP (a shortcoming 
which party motivates ILIUM). To give some comparison, however, 
I estimate the parameters of all 274 stars in the TG grid of noise- 
free spectra via leave-one-out cross validation. The errors in both 
APs are six times larger than the ILIUM result at G=15. This at 
least confirms that ILIUM overcomes the grid resolution limitation 
of NN. 

The above results are for stars with [Fe/H] = dex. I retrained 
and evaluated ILIUM on a grid with the full range of metallicities 
(—4 to 1.0 dex; TG-allmet). The errors averaged over this full data 
set at G=15, reported in line 4 of Table [2] are 6-8 times larger than 
the solar metallicity case. This is entirely due to the new metallicity 
variance which is not accounted for (modelled) by ILIUM and so 
is a confusing factor. (A T e ft accuracy of 1% at G=15 and 2% at 
G=18.5 is nonethelss good for stars which a priori show variance 
over the full range of T e g, log g and [Fe/H]). Curiously, if we apply 
the model to G=20 data, then we see that the performance is no 
worse than when we limited the problem to solar metallicity stars 
(compare lines 3 and 6 of Table|2]l. This is because the photometric 
noise dominates over the variance introduced by the (unmodelled) 
metallicity range. 

If the AP residuals had a Gaussian distribution, then 
e rms = 1.25e m ac in Table [2] The fact that e rms is always larger 



(sometimes much larger), indicates that there are outliers, which 
justifies the use of e mae as a more robust error statistic. 

In terms of overall error, an SVM does better than ILIUM on 
the TG-allmet problem at all three magnitudes (Table |BTJ. It's not 
clear why this is so, given that ILIUM was better on the problem 
limited to solar metallicity (see sectionlBl. 



5 APPLICATION TO THE T oft +[Fe/H] PROBLEM (TM) 

I now use ILIUM to estimate T e g and [Fe/H] on the two grids TM- 
dwarfs and TM-giants defined in section [3~2] In each case we effec- 
tively assume we already have a rough log g estimate. The remain- 
ing spread in log g in each case acts as cosmic scatter. The models 
are then applied to the corresponding evaluation sets with noise 
levels at three magnitudes. The summary performance statistics are 
show in six lines in Table [2] As there is essentially no sensitivity 
to metallicity at T e g above 7000 K (Fig. [9}, I only report results on 
cooler stars, even though the ILIUM models were fit to the full T e g 
range. 

Looking first at the T e g performance, we see that the precision 
is twice as good as TG at all magnitudes (e.g. 0.007 dex compared 
to 0.019 dex). This just indicates that we can estimate T e a more 
precisely for cool stars. Within the T e g range 4000-7000 K there 
is no strong dependence of the T e ff precision with T e g or [Fe/H]: 
ILIUM can estimate T e g equally well at all metallicities. 

Turning now to metallicity, we see good performance at G=15 
and G=18.5 for dwarfs and giants: random errors of 0.3 dex or less 
and negligible systematics. At G=20 the performance is quite a 
lot worse (0.7-0.8 dex). There is little dependence of [Fe/H] pre- 
cision or accuracy with [Fe/H], as can be seen in the lower panel 
of Fig. [15] Even for the most metal poor stars in the sample at 
[Fe/H] = —4.0 dex the precision is still 0.5 dex. This plot also shows 
that the AP estimates hardly ever exceed the limits of the training 
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Figure 13. AP evolution for 5 stars in the evaluation data set at G=15 (log g 
left, T c ff right) for the TG problem. The true APs are written at the top of 
each panel pair and plotted as the red horizontal line. GoF is the reduced \ 2 
goodness-of-fit (equation |12| . 



grid, even though they are allowed to (section [2.5.7[ >, again suggest- 
ing natural convergence properties of the algorithm. 

We gain some insight into how ILIUM works if we include hot 
stars in the evaluation set. There is now a strong systematic error in 
the [Fe/H] estimates, as can seen in the upper panel of Fig.[T5] The 
reason is that the sensitivity of all the spectral bands to metallicity 
is essentially zero in hot stars. In that case there is no contribu- 
tion from metallicity to the flux updates in the ILIUM algorithm, 
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Figure 14. AP residuals for the TG problem at G=15 
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Figure IS. [Fe/H] residuals for the TM-dwarfs model at G=15 for the full 
T c fj range (top) and for cool stars only (T e g < 7000 K; bottom). Note the 
different scales on the ordinate. The systematic error (due to the lack of 
sensitivity to [Fe/H] for hot star spectra) vanishes in the lower panel. 



so the flux prediction is entirely from the strong component of the 
forward model. That predicts a flux corresponding to the average 
value of the metallicity (Fig. [4) in the training grid, so this is the AP 
value which ILIUM reports. This is obviously higher than the low- 
est metallicities, with the result that ILIUM overestimates [Fe/H]. 
This is logical and desirable: ILIUM reports the average value in 
the training data when the spectrum provides no information. The 
AP distribution is acting as a prior. 

The above results have implicitly assumed the log g of the star 
to be known well enough to identify it as a dwarf or giant. We saw 
from the results in the previous section that ILIUM itself can do this 
(even if the metallicity is not known: line 4 of Table[2j. But what if 
we relaxed this assumption? To test this, I trained and evaluated a 
new model, TM-allgrav, on the full range of log g (see section [3^2| l. 
Even at G=18.5 this model can estimate [Fe/H] to a precision of 
0.4 dex, which would be enough to trace the metallicity distribution 
in the Galaxy and identify very metal-poor stars. 
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Figure 16. Predicted AP uncertainties (top) and actual residuals (bot- 
tom) for log(T c ff ) from the TM-dwarf problem at G=20. The area of the 
plotted circle (not the diameter) is proportional to the size of the uncer- 
tainty/residual. In the lower panel, positive residuals are shown in black and 
negative in red. Points below some size limit are not plotted at all. 



Section |277] gave a formula for the covariance of the APs esti- 
mated by ILIUM. For this TM problem, C<^ is a 2 x 2 matrix with 
elements c,y. The corresponding uncertainty estimates, y/cj], are 
plotted as a function of the two APs in the upper panel of Fig.|16| 
This can be compared to the actual errors (residuals) in the lower 
panel. The fact that they broadly agree indicates that the uncertainty 
estimation is useful. This is important, because for many purposes 
knowing the uncertainty in an estimate is as important as the esti- 
mate itself. 



6 APPLICATION TO 3-AP PROBLEMS (TAG & TAM) 
6.1 Extension to higher dimensions 

So far the forward model has comprised two ID components. For 
more than two APs the forward modelling approach described in 
section |2T4| generalizes almost trivially. The principle is to retain the 
partition of the APs into the two categories "strong" and "weak". 
With Ns strong APs, the strong component of the forward model is 
a single N s -dimensional function, fit at the n s unique combinations 
of the strong APs in the training grid. At any one of these points, k, 
there are n w (k) grid points which vary over the N w weak APs. The 
mean flux over these is used to fit the strong component. The weak 
component at point k is then built by making an N w -dimensional 
fit to the flux residuals (with respect to the strong component) over 
the n m (k) points at k. Thus we have n s independent weak compo- 



nents. The components are combined and applied as in the 1D+1D 
case. Nothing else in the ILIUM algorithm is changed. 

Here I apply ILIUM to two different problems both with two 
strong APs and one weak AP. This is especially important for stellar 
parametrization (and Gaia) because in practice we have to accom- 
modate variable interstellar extinction, which can be large at low 
Galactic latitudes and has at least a large an impact as T c fj on the 
spectrum (see Fig.|7J. 

I fit the 2D strong component of the forward model using a 
thin plate spline, a type of smoothing spline closely related to krig- 
ing (e.g. Hastie et al. 2001 1. The number of degrees of freedom is 
set to n s /2 = 330/2 = 155 in both problems. The ID weak com- 
ponents are again fit using smoothing splines with the dof set as 
before (section |2.5.2fr . 



6.2 Results for (T off +A v )+log g (TAG) 

In this problem the forward model is fit over the full range of T fj, 
Ay and logg for solar metallicity stars (TAG in section |3~2} , The 
ID cut through the fit in Fig. |17| shows that the newly introduced 
Ay variance is fit accurately (as are the variations in the other APs). 
The summary performance when applying ILIUM with this model 
to G=15 spectra is shown near the bottom of Table|2] Compared to 
the results on the TG data set at the same magnitude, the errors in 
log(T c ff) and log g are, at 0.013 dex (3%) and 0.3 dex respectively, 
considerably worse. This is due to the extra variance introduced by 
the very wide range of extinction. Yet these errors are still small 
enough to be scientifically useful, especially when we note that the 
systematics are quite small. The new AP, extinction, can be esti- 
mated very well: a mean absolute error of just 0.07 mag. If we limit 
the analysis (without changing the fitted model) to just low extinc- 
tion stars (Ay < 1.0 mag), then Ay can be estimated only slightly 
better (0.056 mag) but log(T e ff ) more so (0.008 dex). 

At G=18.5, the errors of course increase (Table[2|: while T c ff 
and Ay are still manageable, at 1.1 dex logy is not. Fortunately 
Gaia will provide parallaxes for many stars to improve the log g 
estimates (given the T c ff and Ay estimates from ILIUM). 

At G=15 the SVM has comparable performance on this prob- 
lem, but is somewhat better at G=18.5 (see section[B]l. 



6.3 Results for (T eff +A v )+[Fe/H] (TAM) 

We now swap log g for [Fe/H] and train and evaluate ILIUM on the 
TAM problem (which is a dwarf sample: the results are broadly 
similar for a giant sample). The summary performance at G=15 is 
listed near the bottom of Table|2] The performance is degraded sig- 
nificantly compared to the case with no Ay variance (TM-dwarfs), 
much more so for T c g than for [Fe/H]. At 0.46 dex the metallic- 
ity error is acceptable. Alas at G=18.5 this degrades to an almost 
useless level with the additional problem of a large systematic]^] 

If we limit the analysis (at G=15) to just low extinction stars 
(Ay < 1.0 mag), then the [Fe/H] precision is essentially unchanged 
(0.44 dex). This implies that it is no more difficult (on average) to 
determine the metallicity of stars with high extinction than of stars 
with low extinction. So why is the error larger here than in the zero 
extinction case reported for the TM problem? It is because we now 



6 Contrary to expectations, systematic trends can rarely be corrected for. 
We would have to plot the residual vs. the estimated AP, and if the scatter is 
larger than the systematic trend then a correction cannot be made. 
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Figure 17. Predictions (small red stars) of the full forward model for the TAG problem as a function of Av with T e g fixed at 10 000 K and log g at 4.0 dex 
for 12 different bands (the central wavelength of which in indicated at the top of each panel in nm). The black crosses are the (noise-free) grid points. The flux 
plotted on the ordinate is in standardized units. 
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Figure 18. The trend in the mean absolute errors for log(T c ff ) and Ay in 
the TAM problem on the G=18.5 evaluation data set (which spans the full 
range of T c g , Ay and [Fe/H]) 



have a large extinction range a priori, so an uncertainty in determin- 
ing Av corresponds to an uncertainty in [Fe/H] . As [Fe/H] is weak, 
a relatively small uncertainty in Ay is magnified into a larger one 
in [Fe/H]. 

Note that the extinction errors are larger than those reported 
with the TAG problem (0.18 mag compared to 0.07 mag at G=15). 
This is a consequence of having limited the analysis to cool stars. 
Indeed, the mean absolute error on the present problem is only 
0.1 mag if we analyse the hot stars (T e g > 7000 K) rather than the 
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Figure 19. The correlation between the Ay and log(T fj ) residuals for the 
TAM model applied to the G=18.5 evaluation data set 



cool ones: Av can be estimated more accurately for hotter stars. 
This is not surprising because hot star spectra are simpler (e.g. 
no metallicity signature) so it should be easier to untangle the ef- 
fects of temperature and extinction. This is illustrated in Fig. 1 1 8|at 
G=18.5, which plots the dependence of the log(T c ff ) and Av resid- 
uals with these parameters (averaged over all [Fe/H]). It also shows 
that T c ff is estimated more precisely for cool rather than hot stars, 
as found also for the TM problem. 

Nonetheless, the T e ff and Av errors are quite large at G=18.5: 
8-20% for T cff and 0.2-0.6 mag in A v (Fig.[l8j. Moreover, their 
residuals show a strong positive correlation (Fig. \19\: a tendency 
to overestimate one results in an overestimation of the other. This 
suggests that these two APs are degenerate in these low resolution 
BP/RP spectra, something which Fig. [7] also suggests. Let us now 
investigate this further. 
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6.4 Identification of a T cff -Av degeneracy 

Like most estimation algorithms, ILIUM simply tries to find the best 
single solution (plus an uncertainty estimate). If we consider a like- 
lihood function of the data given the APs, P(Data]</>), then the 
algorithm is trying to find the maximum of this, plus some measure 
of the width of this peak. But if there is a degeneracy in the APs 
then the peak position and width are an inadequate summary of the 
likelihood. A degeneracy can be defined as two or more stars which 
have spectra differing by an amount consistent with the photomet- 
ric noise. On observing one of these spectra we would be unable to 
distinguish between the AP solutions. This applies trivially to stars 
with very small AP differences (thus the need to quote an uncer- 
tainty). What we are interested in here are objects which have sig- 
nificantly different APs (a large fraction of the grid range), such that 
a simple approximation of the likelihood function (e.g. a Gaussian 
in the APs with covariance given by equation[TTJ fails to represent 
the uncertainties accurately. 

As ILIUM is a local search method, it is a priori possible that 
when initialized at different points it could converge on different 
(degenerate) solutions. To explore this I ran ILIUM 25 times for 
each star with the initialization chosen at random each time, rather 
than by the nearest neighbour. (I used the TAG model with G=18.5 
spectra.) Fig. [20] plots the various solutions in the T e ff-Av plane 
for three different stars, chosen to illustrate degeneracy. 

For the first star (top panel), we see that 22 of 25 initializa- 
tions over quite a range of T e g and Av converge on a small re- 
gion (slightly offset from the true solution possibly due to noise 
in the spectrum). The three "wrong" solutions occur because the 
algorithm is initialized too far from a good solution: the searching 
method gets stuck in a poor local minimum. Inspection of their pre- 
dicted spectra shows them to be very different from the true spec- 
trum, with GoF values (equation \12) of more than a few hundred. 
Such a high GoF would be used in practice to flag and reject such 
solutions. 

Turning to the second star (middle panel) we see two distinct 
regions of convergence, one of which is close to the true APs. Are 
the other solutions at around T e fi = 13 000 K simply poor (bad con- 
vergence)? When we inspect the predicted spectra (not shown), we 
see this is not the case. The two sets of spectra are indistinguishable 
with the noise, all having similar and small GoF values (0.01-15). 

This is better illustrated by the third star (bottom panel), where 
we see a lack of convergence on one or two solutions in favour 
of a complete ridge of solutions. Fig. |2T] plots the initial and fi- 
nal (predicted) spectra for these 25 runs of ILIUM. All but one or 
two of these final spectra agree very closely with the true spectrum, 
even though they have very different APs. So this is a true degen- 
eracy, and not poor convergence of ILIUM. This degeneracy is also 
reflected by a high value for the expected correlation coefficient 
between T e g and Av calculated by equation Identification of 
similar ridges in AP space of near-identical spectra for other stars 
suggests that the T e ff-A v degeneracy is widespread. 
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Figure 20. Multiple random initialization of ILIUM for three different stars 
(in the three panels). In each panel the starting points are shown as green 
dots connected to the solutions (after 20 iterations) shown as blue triangles. 
The true APs are shown as a red cross. 



and any noise-free spectrum, p', in this grid is, with Sp — p — p' , 
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6.5 Mapping of the T e rf-Av degeneracy 

We can map this degeneracy systematically using just the forward 
model to generate spectra on a fine grid of T e g and Av- For each 
predicted spectrum I adopt as its sigma spectrum that of the star in 
the original grid which has the closest APs. The expected Maha- 
lanobis distance between a hypothetically measured spectrum, p, 



the simplification following because in these simulations there is 
no inter-pixel noise correlation. A degeneracy arises between two 
stars when D 2 is sufficiently small that it could arise just from pho- 
tometric noise. Under the null hypothesis {Ho) that the differences 
between the spectra are only due to Gaussian noise and that each 
pixel is independent, D 2 follows a \ 2 distribution with J — 1 = 67 
degrees of freedom. I will define two stars as non-degenerate only if 
the probability of observing their given distance or more under Ho 
is 1% or less. This corresponds to a critical value of D 2 ira = 96.8 
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Figure 21. The initial (green dashed-dot line), final (blue thick line) and true (red thin line) spectra for the 25 runs of ILIUM for the one star shown in the 
bottom panel of Fig. |2()| plotted vs. band number (BP and RP are separated by the vertical dashed line). The T e g and Ay of the ILIUM solutions (the final 
spectra) are plotted at the top of each panel along with the GoF of each. The true APs are 5500 K and Av =5.0 dex 



(or a reduced-^ value of 1.44). In other words, stars separated by 
a smaller distance are considered degenerate]^] 

For each star in the fine grid I calculate the distance to all of the 
other stars. The corresponding probabilities (of getting that distance 
or more) can then be plotted as a degeneracy map in the T eff -Av 
plane. Maps for 15 stars are shown in Fig. [22] (similar patterns are 
observed throughout the grid). We see that the APs are correlated, 
the high probabilty region extends over a large range of the grid 
and in some cases is multimodal. In other words, there is a strong 
degeneracy over the whole plane. What this means in practice is 
that if we observed a (noisy) spectrum, p, at the position of one of 
the crosses in Fig. [22] then this spectrum is indistinguishable from 
all the spectra lying within the contours, on account of the noise. 



X is the distribution followed by a sum of squares of independent unit 
Gaussian variables, N(Q, 1), and has a (unnormalized) density function 
P'(D 2 ) = D v ~ 2 e~ D ' ''I 2 where v is the degrees-of-freedom. v = I— 1 
because one degree-of-freedom is "lost" by the fact that they have a com- 
mon G magnitude, although this is of no practical significance he re. /jH 
is defined by P(D 2 > Dgjff ) = J^ 2 P'dD 2 = 0.01. Fig.|22jplots 
this probability (and several others) as a contour. 



Therefore we also cannot distinguish between the corresponding 
APs. 

We can compare the degeneracy maps with the expected co- 
variance for individual objects calculated from equation[TT]by treat- 
ing the latter as the covariance in a Gaussian likelihood function in 
the APs. While the estimated correlations agree qualitatively with 
what we observe in Fig.|22] this Gaussian likelihood approximation 
does not accurately reproduce the shape of the degeneracy ridges. 
Indeed it cannot, because the degeneracy ridges are not symmetric 
about the true estimate, and Av cannot be negative. This lack of 
detailed agreement is not very surprising, however, because equa- 
tion [TT] is a valid approximation only when the linear relation in 
equation[T]holds, i.e. when the errors are small. 

The implication of this is significant: It is misleading to report 
a single estimate for T e g and Av (even if accompanied by a sim- 
ple covariance estimate). Rather, we must report a whole ridge of 
solutions. As we can map these degeneracies in advance, the pur- 
pose of a classifier such as ILIUM would be to identify a solution, 
which we use to identify the corresponding degeneracy ridge (ei- 
ther as a probability grid or a fitted approximation to it). This study 
shows that a single nearest neighbour initialization of ILIUM is usu- 
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Figure 22. Degeneracy map (a likelihood function) for the forward model of the TAG grid at G=18.5. For each of the 15 stars (panels) marked as a red cross, 
each contour indicates those stars which have a probability P of having a spectrum indistinguishable to within the expected photometric noise. Contours are 
plotted for logP = {—4, —3, —2,-1,0} (due to the steep sides of the \ 2 function, these are barely distinguishable). 



ally adequate to allow ILIUM to find a good solution and therefore 
the correct ridge. The minority of cases where no good solution 
is found are usually identifiable by the large GoF value. In those 
cases we may want to re-initialize ILIUM from some other point or 
improve the convergence mechanism. 

Note that Fig.[22]is for G=18.5. At G=20 the degeneracy re- 
gions are somewhat larger. At G=15 the SNR is sufficiently high 
that there is no longer any degeneracy on a scale of practical rele- 
vance: the contours in the degeneracy map are smaller than the grid 
point separation (0.2 mag in Av and 0.01 dex in log(T c ft)). This is 
consistent with the small errors in these APs recorded in Table [2] 



6.6 Including additional or prior information 

Additional information could help to reduce the extent of the de- 
generacy. 

The parallax reported by Gaia combined with the apparent G- 
band magnitude gives an estimate of Mg+Ag, where Mg is the 
absolute magnitude in the G band and Ag is the extinction in the 
G-band (which we could estimate directly from ILIUM instead of 
Ay). Combined with an HR-diagram (which is just a prior proba- 
bility density function over T e ff and Mg) this provides additional 
constraints on T c ff and Ag • A method for doing this is outlined in 
Bailer-Jones ( 12010b . 

We could set weak prior probabilities on the APs - the proba- 
bility of the APs unconditional on the BP/RP spectrum, parallax or 
G-band magnitude - based on a simple model of the Galaxy. For 
example, we know that cool stars are much more common than hot 



stars, so an unreddened cool star is a priori more likely that a highly 
extinct OB star. Likewise, if the star is observed at high Galactic lat- 
itude we could confidently assign a lower prior probability to high 
extinctions. This latter prior is particularly useful, because we saw 
in sections [4] and [5] that if the extinction can be fixed then ILIUM 
gives much better estimates. Normally, however, we wouldn't want 
to bias the Gaia AP estimates with detailed, current knowledge of 
Galactic structure, given that the primary goal of Gaia is to improve 
this knowledge. 

For very bright stars (G £ 12; Gaia will saturate at around 
G=6) we may have AP estimates coming from the high-resolution 
RVS spectrograph on Gaia measuring the Call triplet at 860 nm 
(Katz et al. 2004). In principle we can estimate T e g independently 
of Av using these data. Note that because the ridges in Fig.|22|have 
a low inclination, additional information on T c ff is more useful than 
information on Av. 

Finally, estimates of APs from other surveys could be incor- 
porated and combined with the Gaia ones, provided they can be 
interpreted as probability distributions. 

Given the opportunity, one should design an ad hoc spec- 
trophotometric system to address the specific goals of the survey. 
Particularly desirable is a system which maximizes the separabil- 
ity of the APs (a method for doing this was presented by Bailer- 
Jones ,2004 1. This opportunity was taken by the Gaia community, 
which designed a multiband photometric system to address the spe- 
cific Gaia scientific goals (Jordi et al. 2006 1. But this was later re- 
placed with the present (and less desirable) BP/RP spectrophotom- 
etry, ultimately for financial reasons. 
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7 FURTHER APPLICATIONS (AND THE TGM 
PROBLEM) 

In the previous section the forward model was extended to be 2D 
in the strong APs and ID in the weak. In a similar way it can be 
adapted to model one strong AP (e.g. T e g) and two weak APs (e.g. 
[Fe/H] and log g) simultaneously. This is of practical use in situa- 
tions where extinction can be assumed to be low. I used this to fit 
a forward model to the TGM grid (section [3~2] l and applied ILIUM 
as before. The results are shown in the bottom section of Table [2] 
At G=15, the performance in T e g and [Fe/H] is considerably better 
than obtaind with the corresponding TM problems in which there 
was unmodelled log g variation: we now obtain 0.3% in T G ff for the 
full T c ff range and 0.1 dex in [Fe/H] for cool stars, log g is also bet- 
ter than found with TG-allmet, but not with TG where metallicity 
was fixed to zero. This as we would expect: at high SNR, modelling 
the extra variance improves the results in all APs. 

This is not the case at G=18.5. Here the performance of the 
TGM model on all three parameters is similar to that obtained in the 
TG-allmet or TG-allgrav problems on the appropriate T e ff ranges. 
I suspect that the extra variance introduced by the noise is compli- 
cating the problem, so that the modelling the additional AP (rather 
than marginalizing over it) brings no improvement in performance 
(but the obvious advantage that we now determine all three APs in- 
stead of just two.) It is also worth pointing out that when using the 
degeneracy mapping method described in section [63) I find there 
is a very strong and complex degeneracy between [Fe/H] and log g 
atG=18.5. 

ILIUM could be further extended to more or other APs or to 
different problems, provided the partition between strong and weak 
APs can be retained. If such a clear distinction were not possible 
then a different approach to forward modelling would be necessary. 
If we had more than two or three weak or strong APs, then the 
smoothing splines are unlikely to provide adequate fits, unless we 
had a lot of data. To overcome this we would need more structured 
regression models suited for fitting high dimensional sparse data 
sets (e.g. neural networks). 



8 SUMMARY AND CONCLUSIONS 

I have introduced an algorithm for estimating parameters from 
multi-dimensional data. It uses a forward model of the data to effec- 
tively perform a nonlinear interpolation of a template grid via the 
Newton-Raphson method. It is intended to overcome both the non- 
uniqueness issue of direct modelling of inverse problems as well 
as the finite grid density and metric definition issues of k nearest 
neighbours. ILIUM makes use of the sensitivity of the data to the 
astrophysical parameters to find an optimal solution. This is con- 
venient, because in principle it means we don't need to worry too 
much about feature selection: low sensitivity features will automat- 
ically be down weighted. An important component of the algorithm 
is the division of the forward model into two parts which indepen- 
dently model the variance of the "strong" APs (such as T e ff and 
Ay) and the weak APs (such as log g and [Fe/H]). Good fits could 
be obtained using low-dimensional (1 or 2) smoothing splines. As 
it is based on a forward model, ILIUM naturally provides AP uncer- 
tainty estimates (actually full covariances) and a goodness-of-fit, in 
contrast to most inverse modelling methods. 

I applied ILIUM to the problem of estimating APs from sim- 
ulations of the low resolution Gaia spectrophotometry, BP/RP. Re- 
sults are summarized in Table [2] When limited to zero extinction 



stars, T e g can be estimated at G=15 to a (mean absolute) accuracy 
of better than 1% when the metallicity and surface gravity are en- 
tirely unknown (log g =—0.5 to 5.0 dex; [Fe/H] = —4 to +1.0 dex). 
At G=18.5 and G=20.0 (the limiting Gaia magnitude) the average 
T e ff accuracy over the full range of APs is 2% and 4% respectively. 
At G=15 logg can be estimated to 0.5 dex if [Fe/H] is entirely 
unknown but to 0.15 dex if [Fe/H] is also modelled (full range of 
the three APs). Limiting to solar metallicity stars this improves to 
better than 0.1 dex and is still 0.35 dex at G=18.5. [Fe/H] for cool 
stars can be estimated to 0.1 dex at G=15 and to 0.15-0.35 dex at 
G=18.5, the better result obtained if we know the star is a dwarf. 
At G=20 log g and [Fe/H] cannot be estimated to any useful accu- 
racy. Of course, for population studies we can average over many 
stars to reduce the population estimate of metallicity, limited by the 
systematic errors and any correlations in the data. 

If we extend the strong component of the forward model to 
simultaneously model Av over a very wide range (0-10 mag), then 
the performance on the weak APs is significantly degraded on ac- 
count of this extra variance, such that reasonable accuracies can 
be reached at G=15 but not at G=18.5. Av itself can be estimated 
remarkably well, however, 0.07-0.2 mag at G=15 and 0.3-0.5 mag 
at G=18.5. The extra variance also affects the T e g determination. 
However, these statistical errors do not tell the whole story: I have 
shown that there is a strong and ubiquitous degeneracy between 
T e ff and Av intrinsic to the BP/RP spectra. Thus in addition to re- 
porting single optimal estimates, the Gaia catalogue will need to 
provide the corresponding degeneracy map, which can be built in 
advance using the forward model. The degeneracies could be re- 
duced (and the weak APs also then estimated more accurately) if 
we can use additional information. Possible sources include the ap- 
parent magnitude and parallax measured by Gaia, external data, 
and/or weak priors from a very simple Galaxy and stellar evolution 
model. 

For some problems ILIUM outperformed a support vector ma- 
chine in terms of smaller residuals, but in other cases the SVM was 
better. This should be assessed in more detail after improvements 
to the ILIUM algorithm have been explored. For example, the up- 
dating method is very simple - stopping after a fixed number of 
iterations - whereas a more adaptive approach may help on larger 
variance problems (e.g. noisier data). We may also want the AP 
update weighting to take into account the noise (and not just the 
sensitivity). 

Now that we have a forward model, further possibilities for 
modelling open up. ILIUM is just a method for locating the best 
APs. If we define a distance metric (such as equation |13} , then 
we can define a likelihood function, P(D\<f>), which when com- 
bined with a suitable prior defines a (non-analytic) posterior over 
the APs, P(4>\D). We can then use one of many sampling methods, 
such as Markov Chain Monte Carlo, to sample this as a function of 
<f>, thus yielding a complete Bayesian probabilistic solution which 
varies smoothly over the APs. This approach to parameter estima- 
tion has been used in several areas of astronomy, such as galaxy 
classification (e.g. Heavens et al. 2000) and inference of cosmo- 
logical parameters (e.g. Percival et al. 2007 1. (If we dispensed with 
the forward model and just calculated probabilities at points in the 
original grid, then the principle is similar to that used by Shkedy et 
al. 120071 ) While offering some advantages, this approach is much 
slower than ILIUM, because for the present application it would re- 
quire of order 10 4 to 10 5 samples and hence this many evaluations 
of the forward model, compared to about 10 2 for ILIUM. 
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APPENDIX A: ASSESSING MODEL PERFORMANCE 

The ILIUM forward model should obviously be fit on the full avail- 
able grid and using noise-free data. (We need to achieve a min- 
imum grid density in order to reliably assume that the modelled 
signal vary smoothly between the grid points, yet the grid need 
be no denser. Some smoothness assumption must be made by any 
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method of estimating continuous parameters.) Yet at first sight one 
may consider it illegitimate to fit ILIUM on the full grid and then 
assess its performance on (noisy) spectra selected from the same 
grid. (Note that the initializaton set is never included in the evalu- 
ation set!) The equivalent approach is generally considered invalid 
for inverse models, because these are fit by minimizing the AP error 
on a training set: If they are not properly regularized the model will 
overfit the training data by learning non-general aspects of these 
data (such as the noise). But the situation with ILIUM is differ- 
ent, because the best forward model fit is identified (interactively 
at present) according to its smoothness and the photometry error, 
not the AP error. That is, the regularization is also done indepen- 
dently of the AP error. It might anyway seem preferable to start 
with a grid twice as dense as required, fit the forward model on half 
the data and evaluate ILIUM on the other half. But if the smoothness 
requirement is met, this will anyway give a similar performance. 

The underlying issue here - relevant to all machine learning 
methods - is how to make a fair assessment of performance. Ul- 
timately all assessments are limited by the fact that training and 
evaluation data are drawn from a common grid. Yet it would be use- 
less to train the classifier on spectra generated by one set of stellar 
models and evaluate it on spectra generated by another, because this 
"performance" would reflect the intrinsic differences between the 
stellar models. This remains a quandary, also because tests based 
on synthetic spectra ignore the realities of non-Gaussian noise and 
cosmic scatter of real data. The most reliable assessment of perfor- 
mance would be to first determine APs of a set of high resolution 
spectra (using whatever method) based on a particular stellar model 
and define these APs as true. We would then degrade the spectra 
to the dispersion, line-spread-function and noise of real Gaia data 
and compare the ILIUM estimates with the true ones. Such tests are 
planned but are considerably more arduous than what's done here. 



APPENDIX B: COMPARATIVE PERFORMANCE OF A 
SUPPORT VECTOR MACHINE 

It is not the goal of this article to make a detailed comparison 
between the accuracy of ILIUM and conventional machine learn- 
ing methods, but a brief comparison is useful. I therefore applied 
a support vector machine (SVM) (e.g. Cortes & Vapnik 119951 
Burges 1998 1 to some of the problems presented in the article. The 
SVM is used to directly model the inverse problem by training on 
noisy data. Although the SVM training has a unique solution for a 
given set of data, it has three hyperparameters (the length scale pa- 
rameter 7 and two regularization parameters e and C) which must 
be optimized ("tuned"), which is just a higher-level training pro- 
cedure. I optimize these hyperparameters simply via a brute force 
search of a regular grid over the hyperparameters, typically with 
512 or 1000 models (8 or 10 values of each hyperparameter). To 
achieve this the data set must be split into three independent parts: 
the training set, a test set (used to select the best combination of 
hyperparameters) and the evaluation set (on which the final perfor- 
mance is calculated). (If we just use the same data for testing and 
evaluation - as is frequently done - then the results are somewhat 
better; unfairly so, because then the SVM is tuned on the very data 
set on which it is finally evaluated. I nonetheless did this for the TG 
problem because of the small amount of data: just 274 stars.) For 
each of the problems/data sets described in section [372| the set is 
randomly split into three, equal-sized disjoint parts to build these 
subsets. The SVM is tuned on noisy data, separately for each mag- 



Table Bl. Performance of an SVM on a selection of the problems reported 
in Table [2] The error statistic is the mean absolute error, e mac - Figures in 
bold indicate that the error is significantly larger (by more than a third) than 
the ILIUM error, and vice versa for numbers in italics. Note that the models 
for the TG problem used just a two-way split into train/test sets due to lack 
of data, which gives them optimistic results. 
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nitude, and separately for each AP (as an SVM can only model one 
output). 

The SVM results are summarized in Table IB II This obvi- 
ously conceals details, such as the existence of systematic errors 
in some cases. For example, T e fi is systematically underestimated 
for T c fj > 10 000 K and \ogg shows a systematic across the whole 
log g range in the TAG problem at G=18.5. Comparing with the IL- 
IUM results in Table [2] no very clear pattern emerges, with ILIUM 
superior to SVM in some problems and vice versa in others. When 
comparing performance star-by-star on a given problem, we do not 
see that one algorithm performs systematically better than the other 
in some parts of the AP space. Rather, one algorithm is just overall 
better. The fact that ILIUM is superior on the TG problems whereas 
SVM is superior on the TG-allmet problems suggests that SVM 
copes better with problems where there is more unmodelled vari- 
ance, although the better performance of ILIUM on the TM-dwarfs 
problem and TAG at G=15 does not support this suggestion. Pos- 
sibly it is when the variance is higher overall (lower SNR and/or 
unmodelled variance) that SVM gives smaller errors, which would 
be consistent with that method's approach to dealing with noise. A 
more detailed comparison is only warranted after further work has 
been put into optimizing the ILIUM algorithm, such as the conver- 
gence for noisy data or the update clipping procedures. 

SVM has the advantage of being much faster to apply once it 
has been tuned. As the speed of ILIUM is probably dominated by 
the nearest neighbour initialization, this could be replaced with an 
SVM. On the other hand, ILIUM provides AP uncertainty estimates, 
goodness-of-fit estimates (for outlier/poor solution detection) and 
allows one to inspect the relevance of each input in determining the 
output for every object (equation[8j. 



